Translating Dialectal Arabic to English
نویسندگان
چکیده
We present a dialectal Egyptian Arabic to English statistical machine translation system that leverages dialectal to Modern Standard Arabic (MSA) adaptation. In contrast to previous work, we first narrow down the gap between Egyptian and MSA by applying an automatic characterlevel transformational model that changes Egyptian to EG′, which looks similar to MSA. The transformations include morphological, phonological and spelling changes. The transformation reduces the out-of-vocabulary (OOV) words from 5.2% to 2.6% and gives a gain of 1.87 BLEU points. Further, adapting large MSA/English parallel data increases the lexical coverage, reduces OOVs to 0.7% and leads to an absolute BLEU improvement of 2.73 points.
منابع مشابه
Machine Translation of Arabic Dialects
Arabic Dialects present many challenges for machine translation, not least of which is the lack of data resources. We use crowdsourcing to cheaply and quickly build LevantineEnglish and Egyptian-English parallel corpora, consisting of 1.1M words and 380k words, respectively. The dialectal sentences are selected from a large corpus of Arabic web text, and translated using Amazon’s Mechanical Tur...
متن کاملDIRA: Dialectal Arabic Information Retrieval Assistant
DIRA is a query expansion tool that generates search terms in Standard Arabic and/or its dialects when provided with queries in English or Standard Arabic. The retrieval of dialectal Arabic text has recently become necessary due to the increase of dialectal content on social media. DIRA addresses the challenges of retrieving information in Arabic dialects, which have significant linguistic diff...
متن کاملHandling OOV Words in Dialectal Arabic to English Machine Translation
Dialects and standard forms of a language typically share a set of cognates that could bear the same meaning in both varieties or only be shared homographs but serve as faux amis. Moreover, there are words that are used exclusively in the dialect or the standard variety. Both phenomena, faux amis and exclusive vocabulary, are considered out of vocabulary (OOV) phenomena. In this paper, we prese...
متن کاملMulti-Lingual Phrase-Based Statistical Machine Translation for Arabic-English
In this paper, we implement a multilingual Statistical Machine Translation (SMT) system for Arabic-English Translation. Arabic Text can be categorized into standard and dialectal Arabic. These two forms of Arabic differ significantly. Different mono-lingual and multi-lingual hybrid SMT approaches are compared. Mono-lingual systems do always result in better translation accuracy in one Arabic fo...
متن کاملSentiment Lexicons for Arabic Social Media
Existing Arabic sentiment lexicons have low coverage—only a few thousand entries. In this paper, we present several large sentiment lexicons that were automatically generated using two different methods: (1) by using distant supervision techniques on Arabic tweets, and (2) by translating English sentiment lexicons into Arabic using a freely available statistical machine translation system. We c...
متن کامل